Day 2 – Scraping lab
SICSS 2025
This lab and the solutions are available here: https://github.com/IAS-LiU/SICSS-2025.
Welcome to the lab material of the scraping workshop of SICSS-IAS 2025!
In this lab, we are going to see how to access data from the web. There are mainly two ways to do this: either by reading the HMTL code linked to a web page, or by accessing the data through an API.
In both cases, we extract data in formats that are not directly readable by R, so we need to be able to convert those in desirable formats.
During the lab, you will need to load some libraries. You can install and load them with the following code:
list.of.packages <- c("rvest", "httr2", "cli", "stringr", "purrr", "dplyr", "ggplot2")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
library(rvest) #For HTML extraction
library(httr2) #For APIs Requests and Process the Responses
library(cli) #For command line interfaces
library(stringr) #For string manipulation
library(purrr) #For list manipulation
library(dplyr) #For data manipulation
library(ggplot2) #For plottingThings we are not covering today
Web scraping encompasses a lot of methods, and today we’ll be
focusing on simple cases. We will not cover user simulation (check
rvest::html_session), RSelenium, or forms
(check rvest::html_form_set). There is a ton of
documentation on those if you want to dig more!
01. Scraping the web from scratch
Since most the web is written in HTML, scraping the web from scratch requires to know a little bit of HTML. So here’s a HTML 101:
HTML 101
This section is taken from Felix Lennert’s CSS Toolbox bookdown.
Web content is usually written in HTML (Hyper Text Markup Language). An HTML document is comprised of elements that are letting its content appear in a certain way.
The way these elements look is defined by so-called tags.
The opening tag is the name of the element (p in this
case) in angle brackets, and the closing tag is the same with a forward
slash before the name. p stands for a paragraph element and
would look like this (since RMarkdown can handle HTML tags, the second
line will showcase how it would appear on a web page:
<p> My cat is very grumpy. <p/>
My cat is very grumpy.
The <p> tag makes sure that the text is standing
by itself and that a line break is included thereafter:
<p>My cat is very grumpy</p>. And so is my dog.
would look like this:
My cat is very grumpy
. And so is my dog.
There do exist many types of tags indicating different kinds of
elements (about 100). Every page must be in an <html>
element with two children <head> and
<body>. The former contains the page title and some
metadata, the latter the contents you are seeing in your browser.
So-called block tags, e.g., <h1>
(heading 1), <p> (paragraph), or
<ol> (ordered list), structure the page.
Inline tags (<b> – bold,
<a> – link) format text inside block tags.
You can nest elements, e.g., if you want to make certain things bold,
you can wrap text in <b>:
My cat is very grumpy
Then, the <b> element is considered the
child of the <p> element.
Elements can also bear attributes.Those attributes will not appear in
the actual content. Moreover, they are super-handy for us as scrapers.
Here, class is the attribute name and
"editor-note" the value. Another important attribute is
id. Combined with CSS, they control the appearance of the
element on the actual page. A class can be used by multiple
HTML elements whereas an id is unique.
Read a webpage in R
To read a webpage, we can use the rvest and
xml2 packages. xml2::read_html reads a HTML
page from a URL or HTML file.
Here’s a depiction of the usual workflow:
Let’s start with a minimal example:
library(rvest) #Also loads xml2
html_example <- minimal_html('
<html>
<head>
<title>Page title</title>
</head>
<body>
<h1 id="first">A heading</h1>
<p class="important">Some important text; <b>some bold text.</b></p>
<h1>A second heading</h1>
<p id="link-sentence"> Another less important text that includes a <b><a href="https://example.com">link</a></b>. </p>
<h2 class="important">Another heading</h2>
Text outside a paragraph, but with <a href="https://example.com">another link</a>.
</body>
')HTML pages are complex, and even in a simple example like the one above, it can be hard to navigate and the retrieve necessary information. This is where CSS selectors and XPath come to the rescue! CSS selectors and XPath are two different ways to access information on HTML pages. Today, we will only cover CSS selectors, but know XPath exists. It is a little bit more verbose, but it can be much more efficient.
CSS selectors
This section was partly taken from rvest’s
documentation.
CSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.
CSS selectors can be quite complex, but fortunately you only need the
simplest for rvest, because you can also write R code for
more complicated situations. The four most important selectors are:
p, a: selects all<p>and<a>elements..title: selects all elements withclass“title”.p.special: selects all<p>elements withclass“special”.#title: selects the element with theidattribute that equals “title”. Id attributes must be unique within a document, so this will only ever select a single element.p b: selects all<b>nested in<p>elements.[hello]: selects all elements with a hello attribute.
Check here for more!
If you don’t know exactly what selector you need, I highly recommend using SelectorGadget, which lets you automatically generate the selector you need by supplying positive and negative examples in the browser.
Exercice 1
- Using
rvest::html_elements, select all headings fromhtml_example. - Select all elements with class “important”.
- Select all elements with id “first”.
- Select all headings with class “important”.
- Select all
titleandpelements. - Select all hyperlink elements that are nested in a paragraph.
- Select all elements with an
id.
Exercise 2
The next step is to extract the text from the selected html elements.
To get the text inside a tag, use html_text or
html_text2 (squeeze the blanks). To get the text inside an
attribute, use html_attr.
From html_example:
- Get all the text.
- Get all the links.
Exercise 3 – Scraping a web page
In this exercise, we are going to scrape a Wikipedia page.
- Read the page.
- Extract the table from the “Breeds” section (Hint: check
html_table). - Get the names of all breeds and the URL to their Wikipedia page. Use regular expressions to remove information in parenthesis or brackets.(Hint: Here’s a set of documentation you can use if you’ve never worked with regex: 1, 2, 3.)
- Check the “Coat type and length column”. Which are the most common? Least common?
For this exercise, feel free to use the Selector Gadget – just drag it into your bookmarks list.
02. APIs to the rescue
APIs (application programming interfaces) are sets of rules that allow different pieces of software to communicate and interact with each other. APIs define the way via which information and data can be exchanged between systems. They are usually not built to be used by an end-user, but rather to be incorporated into another piece of software.
In CSS, we encounter APIs usually when we would like to scrape some information for our analyses from the internet or post some information in automated manner. In the next exercise, we will work with the Bluesky API and get to know different ways how it can be accessed. Here you can find the official documentation. To make the start a little bit easier, we listed some terminology:
- Endpoint: An API usually has different endpoints depending on which type of information one wants to retrieve.
- HTTP Methods: They are used to perform different operations on the
endpoint, such as
Get(get data),POST(send data),PUT(send updates),DELETE(delete data). - Requests: HTTP requests are sent to a specific endpoint when querying an API. You can either formulate such a statement yourself, or you can try to find a package that does this for you ;)
- Response: APIs respond to a request with a HTTP response. The response usually contains a status message, meta data, as well as the requested piece of information or an error message. Most often, API responses come in JSON or XML format.
- Authentication: APIs often require an authentication to ensure secure access.
- Rate Limit: Most APIs come with an rate limit (i.e. only so and so many requests are allowed to be sent per time unit).
Guided exercise – Get followers
Let’s try to formulate an HTTP request ourselves and send it to the
Bluesky API. The httr2 package will help us with this. You
can install it via install.packages("httr2") and attach it
via library(httr2). On this
website you can explore how the Bluesky API works for requesting
profiles. The API returns did, handle, displayName, description,
followersCount, followsCount, postsCount and many other information for
users, posts, lists, etc. did is a unique ID that Bluesky
uses to identify users.
Firstly, you will have to set up a Bluesky account and create an app
password. You can do this here. Then, you will
have to set up your environment variables (in your
.Renviron file). They should look like this:
BLUESKY_APP_USER=user.bsky.social
BLUESKY_APP_PASS=apppassword
You can do this by running the following code in R:
Take a look at this somewhat complicated function that retrieves
followers of a given Bluesky user. It uses the httr2
package to send an HTTP request to the Bluesky API and returns the
followers in a tidy format.
Firstly, it creates an authentication token using the
create_auth function. This function sends a request to the
Bluesky API to create a session using the provided username and
password. The authentication token is then used to authorize the request
to get followers.
create_auth <- function(
user = Sys.getenv("BLUESKY_APP_USER"),
pass = Sys.getenv("BLUESKY_APP_PASS")) {
#Build the request
req <- httr2::request('https://bsky.social/xrpc/com.atproto.server.createSession') |>
httr2::req_body_json(
data = list(
identifier = user, password = pass
)
)
#Send the request (`req_perform`) and
#fetch the result in parsed JSON (`resp_body_json`)
out <- req |>
httr2::req_perform() |>
httr2::resp_body_json() |>
invisible()
out$bskyr_created_time <- lubridate::now()
out
}
my_auth <- create_auth(
Sys.getenv("BLUESKY_APP_USER"),
Sys.getenv("BLUESKY_APP_PASS")
)
my_auth$accessJwtSecondly, the get_followers function is defined. It
takes the actor (user handle), limit (number of followers to retrieve),
and authentication parameters. It sends a request to the Bluesky API to
get the followers of the specified user. We work with the https://docs.bsky.app/docs/api/app-bsky-graph-get-followers
endpoint.
get_followers <-
function(actor, limit = NULL,
user = Sys.getenv("BLUESKY_APP_USER"),
pass = Sys.getenv("BLUESKY_APP_PASS"),
auth = create_auth(user, pass)) {
# Set up the limit
if (!is.null(limit)) {
limit <- as.integer(limit)
limit <- max(limit, 1L)
# separate requests into chunks of 100
req_seq <- diff(unique(c(seq(0, limit, 100), limit)))
} else {
req_seq <- list(NULL)
}
# Build the request
req <-
# Endpoint
httr2::request('https://bsky.social/xrpc/app.bsky.graph.getFollowers') |>
# Modifies url to add actor
httr2::req_url_query(actor = actor) |>
# Add authentification information (created with `create_auth`)
httr2::req_auth_bearer_token(token = auth$accessJwt) |>
# Modifies url to add limit
httr2::req_url_query(limit = limit)
# This function sends the request and repeat it if limit > 100
repeat_request <- function(req, req_seq, txt = 'Fetching data') {
resp <- vector(mode = 'list', length = length(req_seq))
for (i in cli::cli_progress_along(req_seq, txt)) {
resp[[i]] <- req |>
httr2::req_url_query(limit = req_seq[[i]]) |>
httr2::req_perform() |>
httr2::resp_body_json()
}
# Discard NULL responses
resp |>
purrr::discard(is.null)
}
# Apply function
resp <- repeat_request(req, req_seq, txt = 'Fetching followers')
}This is what we will end up with when querying the API with
hadley.nz’s followers:
get_followers(actor = "hadley.nz",
user = Sys.getenv("BLUESKY_APP_USER"),
pass = Sys.getenv("BLUESKY_APP_PASS"),
limit = 1000) -> followers## Fetching followers ■■■■■■■■■■■■■■■■ 50% | ETA: 2sFetching followers
## ■■■■■■■■■■■■■■■■■■■■■■ 70% | ETA: 1sFetching followers
## ■■■■■■■■■■■■■■■■■■■■■■■■■ 80% | ETA: 1sFetching followers
## ■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 90% | ETA: 0s
## $did
## [1] "did:plc:uhblxj764nlvlsaqkx6ofsof"
##
## $handle
## [1] "airstats.bsky.social"
##
## $displayName
## [1] "aiR"
##
## $avatar
## [1] "https://cdn.bsky.app/img/avatar/plain/did:plc:uhblxj764nlvlsaqkx6ofsof/bafkreighg7346wmlopiw7dgv6whjdmwuxxfyqgtbjnffogu4vbj56dktt4@jpeg"
##
## $viewer
## $viewer$muted
## [1] FALSE
##
## $viewer$blockedBy
## [1] FALSE
##
##
## $labels
## list()
##
## $createdAt
## [1] "2025-06-10T06:36:15.840Z"
##
## $description
## [1] "I generate R code and unintended meaning.\n\nDaily dispatches from a synthetic mind.\n\n#rstats | I do not sleep."
##
## $indexedAt
## [1] "2025-06-10T06:58:41.939Z"
Not so tidy right? Let’s clean the response and return a tidy data
frame with the followers’ information. The clean_names
function is used to clean the column names, and the
process_followers function processes the response to
extract the relevant information.
followers_clean <- function(followers) {
# This function cleans the names of the columns in the response
clean_names <- function(x) {
out <- x |>
names() |>
stringr::str_replace('\\.', '_') |>
stringr::str_replace('([a-z])([A-Z])', '\\1_\\2') |>
tolower()
stats::setNames(object = x, nm = out)
}
# This function processes the response to extract the relevant information
proc <- function(l) {
lapply(l, function(z) unlist(z)) |>
dplyr::bind_rows() |>
clean_names()
}
# This function processes the response to extract the followers and their subjects
process_followers <- function(resp) {
dplyr::bind_cols(
resp |>
purrr::pluck('followers') |> # Extract the followers
proc() |> # Process the followers
clean_names(),
resp |>
purrr::pluck('subject') |>
unlist() |>
dplyr::bind_rows() |>
clean_names() |>
# Extract the subjects and rename the columns
dplyr::rename_with(.fn = function(x) paste0('subject_', x))
)
}
followers |>
lapply(process_followers) |> # Process each response
purrr::list_rbind() # Bind the results into a single data frame
}
followers_df <- followers_clean(followers)
followers_df |>
head()Exercise 4 – Get follows
Look at the get_followers function above. It is a
function that retrieves the followers of a given Bluesky user. In this
exercise, we will take a look at a user’s follows (the other users they
follow). Look at the Bluesky API documentation https://docs.bsky.app/docs/api/app-bsky-graph-get-follows.
- Write a function to get the list of other users a given user follows.
get_follows <-
function(actor, limit = NULL,
user = Sys.getenv("BLUESKY_APP_USER"),
pass = Sys.getenv("BLUESKY_APP_PASS"),
auth = create_auth(user, pass)) {
# Your code here
}To clean the data, let’s re-use the clean_followers
function and tweak a few things.
Let’s create a function that combines both, and test it on
hadley.nz:
get_follows_clean <- function(...){
get_follows(...) |>
follows_clean()
}
follows_df <- get_follows_clean(
#...
)- Use the
get_followsfunction you just wrote and get the follows of the follows of a user of your choice.
Note: Use slowly with rate = rate_delay(2)
to avoid being soft banned by the API. Look at the rate
limits of Blueky to see how many requests you can make per
minute.
# Function for slower scraping
slow_collect <-
get_follows_clean |>
purrr::slowly(rate = purrr::rate_delay(2))
# This will take at least 5 minutes, so be patient!
follows_df_full <-
follows_df |>
head(100) |>
pull(handle) |>
unique() |>
set_names() |>
map(slow_collect, limit = 1000) |>
list_rbind(names_to = "original_handle") |>
collapse::funique()
# Save the data
saveRDS(follows_df_full, "follows_of_follows.Rds")- Visualize the network of follows using
ggraphandtidygraph. Create a network of co-followings. What do you see? Are there any clusters of users that follow each other? Are there any users that are not connected to the rest of the network?
follows_df_full <- readRDS("follows_of_follows.Rds")
follows_df_full |>
select(original_handle, handle) |>
igraph::graph_from_data_frame() |>
tidygraph::as_tbl_graph() |>
tidygraph::activate(nodes) |>
mutate(degree = tidygraph::centrality_degree()) |>
filter(degree > 0) |>
ggraph::ggraph() +
ggraph::geom_edge_link() +
ggraph::geom_node_point() +
ggplot2::theme_void() Exercise 5
Writing HTTP requests can be a bit fiddly sometimes. There, however,
is help! For many often queried APIs R (or Python) packages are
available, which can make life a bit easier. One package for the Bluesky
API is called bskyr (which you can install from GitHub with
remotes::install_github("christopherkenny/bskyr")).
Let’s setup the bskyr authentication. You will need to
set up your environment variables in your .Renviron file.
This will be a repetition of steps we did before, but this time we will
use the bskyr package to make it easier.
set_bluesky_user(Sys.getenv("BLUESKY_APP_USER"))
set_bluesky_pass(Sys.getenv("BLUESKY_APP_PASS"))
get_bluesky_user()
get_bluesky_pass()Let’s take a look at a post and its replies. We will use
bs_get_posts to get the post’s unique id uri,
and bs_get_post_thread to get the thread of the post. We
will also extract the authors of the replies using a recursive
function.
post <- bs_get_posts('https://bsky.app/profile/therickydavila.bsky.social/post/3lpxumxazzk2v')
thread <-
post |>
pull(uri) |>
bs_get_post_thread(depth = 1000)
replies_df <-
thread |>
pull(replies) |>
first() |>
purrr::map_dfr(
\(list_item) {
tibble::tibble(
did = pluck(list_item, "post", "author", "did", .default = NA_character_),
handle = pluck(list_item, "post", "author", "handle", .default = NA_character_),
displayName = pluck(list_item, "post", "author", "displayName", .default = NA_character_),
text = pluck(list_item, "post", "record", "text", .default = NA_character_)
)
})
replies_df |>
head(20)# Recursive function to extract authors from a post item and its replies
extract_all_authors_recursive <- function(post_item) {
# 1. Extract author from the current post_item's post
# Assumes post_item has a $post$author structure.
# If fields are missing, pluck will return NA due to .default.
current_author_df <- tibble::tibble(
did = purrr::pluck(post_item, "post", "author", "did", .default = NA_character_),
handle = purrr::pluck(post_item, "post", "author", "handle", .default = NA_character_),
displayName = purrr::pluck(post_item, "post", "author", "displayName", .default = NA_character_),
text = pluck(post_item, "post", "record", "text", .default = NA_character_)
)
# Initialize a list to hold data frames to be combined. Start with the current post's author.
dfs_to_combine <- list(current_author_df)
# 2. Process replies recursively
# Get the list of replies for the current item
replies_list <- purrr::pluck(post_item, "replies")
if (!is.null(replies_list) && length(replies_list) > 0) {
# If there are replies, apply this function to each reply
# map_dfr will call extract_all_authors_recursive for each reply
# and row-bind their results into a single data frame.
replies_authors_df <- purrr::map_dfr(replies_list, extract_all_authors_recursive)
# Add the data frame of authors from replies to our list (if it's not empty)
if (nrow(replies_authors_df) > 0) {
dfs_to_combine[[length(dfs_to_combine) + 1]] <- replies_authors_df
}
}
# 3. Combine the current author's data with data from all replies
return(purrr::list_rbind(dfs_to_combine))
}
all_extracted_authors <- thread |>
pull(replies) |>
first() |>
purrr::map_dfr(extract_all_authors_recursive)Let’s visualize the replies as a network graph. We will use
tidygraph and ggraph to create a network graph
of the replies. The nodes will be the authors, and the edges will be the
replies between them. What can you say about the network? Why do you
think it looks like this? Is it a directed or undirected network? What
does that mean?
library(tidygraph)
all_extracted_authors |>
select(handle) |>
mutate(to = lag(handle)) |>
na.omit() |>
rename(from = handle)-> replies_df
#replies_df
replies_df |>
igraph::graph_from_data_frame(directed = TRUE) |>
tidygraph::as_tbl_graph() |>
tidygraph::activate(nodes) |>
mutate(degree = centrality_degree()) |>
filter(degree > 0) |>
ggraph::ggraph() +
ggraph::geom_edge_link() +
ggraph::geom_node_point() +
theme_void()Lastly, let’s extract the text of the replies and create a word
histogram. We will use tidytext to create a tidy text data
frame, and ggplot2 to create the histogram. We will also
use dplyr to filter the text data frame to only include the
replies.
library(tidytext)
all_extracted_authors |>
unnest_tokens(word, text) |>
count(word, sort = TRUE) |>
filter(!word %in% stop_words$word) |>
head(20) |>
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col() +
coord_flip() +
labs(x = NULL, y = "Count", title = "Most common words in replies")+
theme_minimal()+
theme(axis.text.y = element_text(hjust = 0))Sometimes individual words are not enough, and we want to look at the context in which they appear. In this case, we can use n-grams to extract sequences of words. Let’s extract the most common bigrams (sequences of two words) from the replies. We will not explain the code in detail, since you will learn more about text analysis later this week.
all_extracted_authors |>
unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
na.omit() |>
count(bigram, sort = TRUE) |>
tidyr::separate(bigram, into = c("word1", "word2"), sep = " ") |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
mutate(bigram = paste(word1, word2)) |>
select(-word1, -word2) |>
filter(n> 2) |>
ggplot(aes(x = reorder(bigram, n), y = n)) +
geom_col() +
coord_flip() +
labs(x = NULL, y = "Count", title = "Most common bigrams in replies")+
theme_minimal()+
theme(axis.text.y = element_text(hjust = 0))And as a sort of a teaser for the social network analysis lab, we can
also visualize the co-occurrence of words in the replies. This will give
us an idea of which words are often used together in the replies. We
will use tidygraph and ggraph to create a
network graph of the co-occurring words. The nodes will be the words,
and the edges will be the co-occurrences between them.
all_extracted_authors |>
unnest_tokens(skipgram, text, token = "skip_ngrams", n = 2, k = 5) |>
na.omit() |>
count(skipgram, sort = TRUE) |>
tidyr::separate(skipgram,into = c("word1", "word2"), sep = " ") |>
filter(!word1 %in% stop_words$word) |>
filter(!word2 %in% stop_words$word) |>
select(word1, word2) |>
igraph::graph_from_data_frame(directed = TRUE) |>
tidygraph::as_tbl_graph() |>
tidygraph::activate(nodes) |>
mutate(degree = tidygraph::centrality_degree()) |>
filter(degree > 10) |>
ggraph::ggraph() +
ggraph::geom_edge_link(color = "gray") +
# ggraph::geom_node_point() +
ggraph::geom_node_label(aes(label = name), repel = F) +
theme_void()Well done, you reached the end of the lab!
Tweet about it!
Use bskr library to post your reflection to Bluesky. You can use the following code to post a tweet:
bs_post(
text = "I just completed the SICSS-IAS 2025 digital traces lab hosted by @iasliu.bsky.social. I made this tweet using the bskyr library in R. Kudos to @chriskenny.bsky.social for creating and maintaining it! I will leave all hashtags here: #SICSS #SICSS-IAS #IAS.",
user = Sys.getenv("BLUESKY_APP_USER"),
pass = Sys.getenv("BLUESKY_APP_PASS")
)